Contingency table estimation via sketches

ABSTRACT

Systems and methods that enhance estimate(s) of features (e.g., word associations), via employing a sampling component (e.g., sketches) that facilitates computations of sample contingency tables, and designates occurrences (or absence) of features in data (e.g., words in document lists). The sampling component can further include a contingency table generator and an estimation that employs a likelihood argument (e.g., partial likelihood, maximum likelihood, and the like) to estimate features/word pair(s) associations in the contingency tables.

CROSS-REFERENCE TO RELATED APPLICATIONS

This is an application claiming benefit Under 35 U.S.C. 119(e) of U.S.Provisional Application No. 60/717,316 filed on Sep. 15, 2005, entitled,“USING SKETCHES TO ESTIMATE CONTINGENCY TABLES”, the entirety of whichis hereby incorporated by reference as if fully set forth herein.

BACKGROUND

Memory storage expansion and processing capabilities of computers haveenabled massive amounts of data to be accumulated and analyzed bycomplex and intelligent algorithms. For instance, given an accumulationof data, algorithms can analyze such data and locate patterns therein.Such patterns can then be extrapolated from the data, persisted ascontent of a data mining model or models, and applied within a desiredcontext. With the evolution of computers from simple number-crunchingmachines to sophisticated devices, numerous services are supplied fordata trending and analysis.

Usage for such data analysis tools has increased dramatically as societyhas become more dependent on databases and similar digital informationstorage mediums. Such information is typically analyzed, or “mined,” tolearn additional information regarding customers, users, products, andthe like.

For example, data mining can be employed in searching through largeamounts of data to uncover patterns and relationships contained therein.In the data mining world, there are at least two operations that areperformed with data indicated by the client. These operations aretraining (finding patterns in client data) and prediction (applying suchpatterns to infer new/missing knowledge about client data). Moreover,data mining can be employed to explore large detailed businesstransactions, such as pairing up items for sale or “associativeselling”, wherein businesses desire to correlate various product namesbased upon a particular buyer's buying habits. Such associative processcan also be expanded beyond direct product sales. It can be utilizedindirectly to enhance search capabilities in conjunction with wordqueries.

Word associations (e.g., co-occurrences or joint frequencies) have awide range of applications including: Speech Recognition, OpticalCharacter Recognition and Information Retrieval (IR). Althoughassociations can be readily performed for a small corpus, yet computingplurality of scores for numerous data such as the Web, can become adaunting challenge (e.g., having billion number of web pages andmillions of word types.) For example, for a small corpus, one couldcompute pair-wise (two-way) associations by multiplying the (0/1)term-by-document matrix with its transpose. Yet, such an approach canbecome infeasible at Web scale. Furthermore, the computation and storagecost can increase exponentially for multi-way associations.

Although deriving associations among data (e.g., word search queries) isextremely advantageous, it is also generally very difficult to actuallydetermine such associations. Typically, the difficulty for deriving suchassociations is in part due to factors such as: complex computingrequirements, complexity in accessing and retrieving the necessaryinformation, and/or long computational calculation times, and the like.In general, a process reviews the data and examines patterns in thedata, along with the frequency in which the patterns appear. Thesepatterns, in turn facilitate determining “association rules”, which canbe further analyzed to identify the likelihood for predictedoutcomes—given a particular set of data.

For large amounts of data, the review process to determine associationrules often requires searching entire document collections and employinglarge amounts of memory. It is common for all available memory to beutilized before all of the data has been reviewed. This causes decreasedperformance in operations of computer systems.

Estimates can provide a suitable approach to mitigate a requirement toexamine every document to determine whether two words are stronglyassociated or not. Web search engines can produce estimates of pagehits. For example, hits for two high frequency words “a” and “the” canyield a large number of web pages (D) D=10¹⁰ for English documents.Accordingly, one can employ estimated co-occurrences from a small sampleto compute the test, statistics, most commonly the Pearson's Chi-squaredtest, the likelihood ratio test, the Fisher's exact test, as well assome non-statistical metrics such as cosine similarity or resemblance,also widely used in computational Linguistics and Information Retrieval.

The conventional sampling method randomly selects D, documents from acollection of size D and counts the word co-occurrences within thesample. In terms of the term-by-document matrix, which has M rows (M aninteger indicating the number of word types) and D columns, theconventional sampling randomly selects a number of documents (D_(s))columns. As such, typically all words are sampled at a same rate. Forexample, such sampling rate is not higher for words consideredinteresting to a user, and lower for words considered less interestingto a user.

Therefore, there is a need to overcome the aforementioned exemplarydeficiencies associated with conventional systems and devices.

SUMMARY

The following presents a simplified summary in order to provide a basicunderstanding of some aspects of the claimed subject matter. Thissummary is not an extensive overview. It is not intended to identifykey/critical elements or to delineate the scope of the claimed subjectmatter. Its sole purpose is to present some concepts in a simplifiedform as a prelude to the more detailed description that is presentedlater.

The subject innovation provides for systems and methods that enhanceestimate(s) of features (e.g., word associations), via employing asampling component that facilitates computations of sample contingencytables, and designates occurrences (or absence) of features in data(e.g., words in document list(s)). The sampling component can include anon-random sampling feature to obtain a sketch or representation of dataand/or words. Such sketch can then be employed for constructingcontingency tables, wherein an estimation component utilizes alikelihood argument (e.g., partial likelihood, maximum likelihood, andthe like) to estimate features (e.g., word pair(s) associations) forgenerated contingency tables. Moreover, stopping rules for sample sizeselection can be based on such likelihood arguments.

Accordingly, a more general sampling method (e.g., with non-randomsampling as opposed to conventional random sampling methods) can beprovided, wherein the estimations associated therewith are based onlikelihood. The sampling component can also employ a variable rate ofsampling depending on search criteria (e.g., rareness or prominence ofword usage). Thus, a typical requirement to examine an entire document(or list of documents) to determine whether two words are associated canbe mitigated.

In one exemplary aspect, a contingency table in a sketch space isinitially computed. Such contingency table can include a plurality ofcells that form a matrix of integers, which designate occurrences (orabsence) of words within documents. Subsequently, a maximum likelihoodargument can be employed to locate the most likely contingency table inthe original space, while at the same time considering estimateassociations, and the already imposed constraints (e.g., presence orabsence of words in a document, number of occurrences, frequency ofoccurrences, and the like). Therefore, an entire document list need notbe examined, and associations between data can be readily determined viathe sampling component.

According to a further aspect, the subject innovation constructs samplecontingency table(s) (e.g., a matrix of integers) from sketches, thusconnecting powerful sketch techniques with conventional statisticalmethods. As such, conventional statistical techniques (e.g., maximumlikelihood estimation—MLE), and large sample theory can be employed toanalyze estimation errors (e.g., variances). Therefore, the contingencytable construction of the subject innovation also enables statisticalhypothesis testing (e.g., X² test or G² test, or multiple testing, andthe like.)

In a related aspect, to estimate associations between word W₁ and wordW₂, the subject innovation employs a likelihood function that leveragesconstraints such as: the size of the collection D (e.g., total number ofdocuments in the collection); the margin (length of posting lists)f₁=a+b, (wherein “a” is the number of documents that contain both W₁ andW₂, and “b ” is the number of documents that contain W₁ but not W₂); themargin f₂=a+c, (wherein “c” is the number of documents that contain W₂but not W₁); and D=a+b+c+d (wherein “d” is the number of documents thatcontain neither W₁ nor W₂). Likewise a_(s), b_(s), c_(s), d_(s)correspond to a sample contingency table in a sample space s. Variousartificial intelligence components can also be employed in conjunctionwith estimating associations for the word pairs.

According to a further aspect, the subject innovation can employ anenhanced sampling (e.g., non-randomized sampling), in conjunction withmaximum likelihood estimation, wherein a_(s)=a·s (s being a samplingrate). Hence, in contrast to conventional methods, (e.g., a_(s)=a·s²)the subject innovation supplies a more dynamic range on a cell of acontingency table (e.g., a cell with smallest counts.) By enhancing thesampling procedure, the subject innovation can facilitate obtaining abetter resolution, even though a more complex estimation can be required(e.g., a likelihood argument.)

To the accomplishment of the foregoing and related ends, certainillustrative aspects of the claimed subject matter are described hereinin connection with the following description and the annexed drawings.These aspects are indicative of various ways in which the subject mattercan be practiced, all of which are intended to be within the scope ofthe claimed subject matter. Other advantages and novel features maybecome apparent from the following detailed description when consideredin conjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a sampling component that operates on an originalspace of data, to form a sketch space.

FIG. 2 illustrates estimate tables from margins and samples inaccordance with an aspect of the subject innovation.

FIG. 3 illustrates an exemplary graph that compares percentage ofintersections that employ sketches according to a particular aspect ofthe subject innovation and conventional methods of random sampling.

FIGS. 4-6 illustrate examples of sample contingency table construction,in accordance with a particular aspect of the subject innovation.

FIG. 7 illustrates an exemplary graph that shows lack of a strongdependency of documents in sample space (D_(s)), on the number ofdocuments that contain both W₁ and W₂ in the sample space/sketch(a_(s)).

FIG. 8 illustrates an exemplary block diagram of a system that estimatesword associations in accordance with an aspect of the subjectinnovation.

FIG. 9 illustrates an exemplary methodology of estimating contingencytables via sketches.

FIG. 10 illustrates an exemplary environment for implementing variousaspects of the subject innovation.

FIG. 11 is a schematic block diagram of an additional-computingenvironment that can be employed to implement the subject innovation.

DETAILED DESCRIPTION

The various aspects of the subject innovation are now described withreference to the annexed drawings, wherein like numerals refer to likeor corresponding elements throughout. It should be understood, however,that the drawings and detailed description relating thereto are notintended to limit the claimed subject matter to the particular formdisclosed. Rather, the intention is to cover all modifications,equivalents, and alternatives falling within the spirit and scope of theclaimed subject matter.

As used herein, the terms “component,” “system” and the like areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution. For example, a component may be, but is not limited to being,a process running on a processor, a processor, an object, an executable,a thread of execution, a program, and/or a computer. By way ofillustration, both an application running on computer and the computercan be a component. One or more components may reside within a processand/or thread of execution and a component may be localized on onecomputer and/or distributed between two or more computers. The word“exemplary” is used herein to mean serving as an example, instance, orillustration. Any aspect or design described herein as “exemplary” isnot necessarily to be construed as preferred or advantageous over otheraspects or designs.

Furthermore, the disclosed subject matter may be implemented as asystem, method, apparatus, or article of manufacture using standardprogramming and/or engineering techniques to produce software, firmware,hardware, or any combination thereof to control a computer or processorbased device to implement aspects detailed herein. The term computerprogram as used herein is intended to encompass a computer programaccessible from any computer-readable device, carrier, or media. Forexample, computer readable media can include but are not limited tomagnetic storage devices (e.g., hard disk, floppy disk, magnetic strips. . . ), optical disks (e.g., compact disk (CD), digital versatile disk(DVD) . . . ), smart cards, and flash memory devices (e.g., card,stick). Additionally it should be appreciated that a carrier wave can beemployed to carry computer-readable electronic data such as those usedin transmitting and receiving electronic mail or in accessing a networksuch as the Internet or a local area network (LAN). Of course, thoseskilled in the art will recognize many modifications can be made to thisconfiguration without departing from the scope or spirit of the claimedsubject matter.

Turning initially to FIG. 1, a system 100 is illustrated that enhancesestimates of association for a word pair 105, via employing a samplingcomponent 110 with non-random capabilities to construct samplecontingency table(s) 130. The sampling component 110 can sample the dataaccording to a predetermined rate, to mitigate a requirement to examinean entire document (or list of documents) for determining whether twowords are associated. As such, the space 120 can be sampled to form asample space 125. The sampling component 110 can include non-randomfeatures and/or a varying sampling rate, to provide for a flexiblesampling rate depending on search criteria (e.g., rareness or prominenceof word usage).

Accordingly, a more general sampling method (e.g., with non-randomsampling as opposed to conventional random sampling methods) can beprovided, wherein the estimations associated therewith are based onlikelihood. Thus, a typical requirement to examine an entire document(or list of documents) to determine whether two words are associated canbe mitigated. The sample contingency table 130 in a sketch space isinitially computed. Such contingency table 130 can include a pluralityof cells that form a matrix of integers, which designate occurrences (orabsence) of words within documents. Subsequently, a maximum likelihoodargument can be employed to locate the most likely contingency table 133in the original space, while at the same time considering estimateassociations, and the already imposed constraints (e.g., presence orabsence of words in a document, number of occurrences, frequency ofoccurrences, and the like). Therefore, associations between data can bereadily determined via the sampling component.

Referring now to FIG. 2, there is illustrated two way associationsrepresented as two-way contingency tables. The subject innovation canconstruct the sample contingency table 220 and estimate the contingencytable 240 from table 220, via an estimation component 260. Theestimation component 260 as described in detail infra can employ alikelihood argument (e.g., partial likelihood, maximum likelihood, andthe like). By enhancing the sampling procedure, the subject innovationcan facilitate obtaining a better resolution, even though a more complexestimation can be required (e.g., a likelihood argument.)

In one exemplary aspect, a standard inverted index can be considered.For example for the word W₁, a set of postings P₁ (P₁ being an integer)can exist that contain a set of document IDs, such as one for eachdocument containing W₁. Moreover, the size of the posting f₁=|P₁| cancorrespond to the margins of the contingency table 240, also known asdocument frequencies (df) in information retrieval (IR). Such postinglists can be estimated by sketches K. Assuming that document IDs arerandom (e.g., obtained by random permutations) K₁ can then be computedas a random sample of P₁, by selecting the first few elements of P₁.

As depicted in FIG. 2, considering the words W₁, W₂, a is the number ofdocuments that contain both W₁, W₂. Likewise, b is the number is thenumber of documents that contain W₁ but not W₂, and c is the number ofdocuments that contain W₂ but not W₁. Also, d is the number of documentsthat contain neither W₁ nor W₂ As will be described in detail infra, alikelihood function associated with the estimation component 260 canleverage constraints such as: the size of the collection D (e.g., totalnumber of documents in the collection a+b+c+d), the margin (length ofposting lists) f₁=a+b, and the margin f₂=a+c. Thus, the samplecontingency table 220 can include a plurality of cells that form amatrix of integers, which designate occurrences (or absence) of wordswithin documents. Subsequently, a maximum likelihood argument, asdescribed in detail infra, can be employed to locate the most likelycontingency table in the original space, while at the same timeconsidering estimate associations, and other already imposed constraints(e.g., presence or absence of words in a document, number ofoccurrences, frequency of occurrences, and the like). Therefore, anentire document list need not be examined, and associations between datacan be readily determined via the sampling component 260 that can employnon-random methodologies. To facilitate description and for furtherappreciation of the subject innovation, a conventional sketch algorithmknown in the art as “Broder” method is initially described below, andthe enhancements upon such algorithm by the subject innovation inconjunction with contingency tables is subsequently described.

In general, sketches are conventionally designed to find duplicate pagesfor a web crawling application. The similarity of two web pages istypically defined in terms of resemblance (R). In the Broder method, itcan be assumed that each document in the corpus of size D is assigned aunique ID between 1 and D. P₁, the postings for word W₁ is a sorted listof f₁ document IDs. Similarly, P₂ denotes the postings for word W₂.Initially, a random permutation can be performed on the document IDs andthe smallest IDs in the postings P₁ and P₂, (denoted as MIN(P₁) andMIN(P₂), respectively) recorded. As such, the possibility ofMIN(P₁)=MIN(P₂) can

$\begin{matrix}{{{be}\mspace{14mu}{{P_{1}\bigcap P_{2}}}\mspace{14mu}{out}\mspace{14mu}{of}\mspace{14mu}{{P_{1}\bigcup P_{2}}}},\mspace{11mu}{e.g.},{{P\left( {{{MIN}\left( P_{1} \right)} = {{MIN}\left( P_{2} \right)}} \right)} = {\frac{{P_{1}\bigcap P_{2}}}{{P_{1}\bigcup P_{2}}} = {R\left( {W_{1},W_{2}} \right)}}},} & \left( {{eq}.\mspace{14mu} 1} \right)\end{matrix}$

Therefore, the resemblance of W₁ and W₂, e.g.,

${{R\left( {W_{1},W_{2}} \right)} = \frac{{P_{1}\bigcap P_{2}}}{{P_{1}\bigcup P_{2}}}},$and can be estimated in an unbiased manner by repeating the permutationk times independently, in a straightforward manner:

$\begin{matrix}{{\hat{R}}_{B,r} = \frac{\#\left\{ {{{MIN}\left( P_{1} \right)} = {{MIN}\left( P_{2} \right)}} \right\}}{k}} & \left( {{eq}.\mspace{14mu} 2} \right)\end{matrix}$

Accordingly, Broder's original sketch algorithm employs typically onlyone permutation on the document IDs. After the permutation, the postingsP₁ can be sorted ascending, and the sketch, K₁ is then the first(smallest) k₁ document IDs in P₁. The Broder method employs MIN_(k)(Z)to denote the k smallest elements in the set, Z. Thus, K₁=MIN_(k) ₁ (P₁), and K₂ denotes its sketch, MIN_(k) ₂ (P₂). Moreover, the Brodermethod restricted k₁=k₂=k, and estimated the resemblance by

$\begin{matrix}{{{\hat{R}}_{B} = \frac{{{MIN}_{k}\left( {K_{1}\bigcup K_{2}} \right)}\bigcap K_{1}\bigcap K_{2}}{{{MIN}_{k}\left( {K_{1}\bigcup K_{2}} \right)}}},} & \left( {{eq}.\mspace{14mu} 3} \right)\end{matrix}$and proved E({circumflex over (R)}_(B))=R.

Using the notation in FIG. 2, a=|P₁∩P₂|. One can divide the set P₁∪P₂(of size f₁+f₂−a) into two disjoint sets: P₁∩P₂ and P₁∪P₂−P₁∩P₂, whosesizes are a and f₁+f₂−a−a=b+c, respectively. Within the set MIN_(k)(K₁∪K₂) (of size k), the document IDs that belong to P₁∩P₂ can beMIN_(k) (K₁∪K₂)∩K₁∩K₂, whose size is denoted by a_(s) ^(B). As such, ahypergeometric sample can be obtained, e.g., sampling k document IDsfrom P₁∪P₂ randomly without replacement and obtaining a_(s) ^(B) IDsthat belong to P₁∩P₂. By the property of the hypergeometric distributionthe expectation of a_(s) ^(B) can be

$\begin{matrix}\begin{matrix}{{E\left( a_{s}^{B} \right)} = \left. \frac{ak}{f_{1} + f_{2} - a}\Rightarrow{E\left( \frac{a_{s}^{B}}{k} \right)} \right.} \\{= \frac{a}{f_{1} + f_{2} - a}} \\{= \left. \frac{{P_{1}\bigcap P_{2}}}{{P_{1}\bigcup P_{2}}}\Rightarrow{E\left( {\hat{R}}_{B} \right)} \right.} \\{= {R.}}\end{matrix} & \left( {{eq}.\mspace{14mu} 4} \right)\end{matrix}$

Such sketch (miniwise sketch) can be considered as a“sample-with-replacement” version of the original sketch. In {circumflexover (R)}_(B,r), the additional subscript r indicates“sample-with-replacement.”

Since the “miniwise” sketch is a binomial sample and the “original”sketch is a hypergeometric sample, the associated variances can bewritten as:

$\begin{matrix}{{{{Var}\mspace{11mu}\left( {\hat{R}}_{B,r} \right)} = {\frac{1}{k}{R\left( {1 - R} \right)}}},{{{Var}\mspace{11mu}\left( {\hat{R}}_{B} \right)} = {\frac{1}{k}{R\left( {1 - R} \right)}{\frac{\left( {f_{1} + f_{2} - a} \right) - k}{\left( {f_{1} + f_{2} - a} \right) - 1}.}}}} & \left( {{eq}.\mspace{14mu} 5} \right)\end{matrix}$wherein the term

$\frac{\left( {f_{1} + f_{2} - a} \right) - k}{\left( {f_{1} + f_{2} - a} \right) - 1}$is often referred to as the “finite population correction factor”.

When k is not too large and in terms of accuracy, the difference betweenthe two sketch constructions can be very small, given the same sketchsize. Once the resemblance R is estimated, one could estimate theoriginal contingency table in FIG. 2 from the estimated resemblance andknown margins as:

$\begin{matrix}{{\hat{a}}_{B} = {\frac{{\hat{R}}_{B}}{1 + {\hat{R}}_{B}}\left( {f_{1} + f_{2}} \right)}} & \left( {{eq}.\mspace{14mu} 6} \right)\end{matrix}$â_(B) does not make full use of the sample, and eq. 3 indicates thatonly k samples are employed in the estimation while the total number ofsamples are 2×k. Some of the discarded samples can include usefulinformation. In contrast, according to an exemplary aspect of thesubject innovation all useful samples can be employed. As such, andsince the estimation variances (errors) are often inverse proportionalto the sample size, the subject innovation can typically supply twice anaccuracy of â_(B) as compared to Broder algorithm described above. Thesubject innovation also provides for advantages such as additionalflexibility (e.g., mitigating a requirement of k₁=k₂)

FIG. 3 illustrates the effectiveness of sketches, when compared to arandom sampling for two postings and an intersect of the samples toestimate associations. For example, a random sample of size k from P₁(denoted as Z₁), and a random sample Z₂ of size k from P₂ can beobtained. Assuming a_(s) ^(Z)=Z₁∩Z₂, and f₁=f₂=f for simplicity, it isapparent that

${E\left( \frac{a_{s}^{Z}}{a} \right)} = \frac{k^{2}}{f^{2}}$which is represented by the dashed curve 320 in FIG. 3. In contrast,with sketches K₁ and K₂ (k smallest IDs in P₁ and P₂, respectively),a_(s)=|K₁∩K₂|, one can obtain

${{E\left( \frac{a_{s}}{a} \right)} \approx \frac{k}{f}},$(held with very good accuracy) as shown by the solid curves 310 FIG. 3.Therefore, by comparing the percentage of intersections,

${{E\left( \frac{a_{s}^{Z}}{a} \right)}\mspace{14mu}{and}\mspace{14mu}{E\left( \frac{a_{s}}{a} \right)}},$the sketch can supply a significant improvement over sampling overpostings. Moreover, the difference between

$\frac{k}{f}\mspace{14mu}{and}\mspace{14mu}\frac{k^{2}}{f^{2}}$becomes particularly important at low sampling rates.

As explained earlier, FIG. 3 illustrates an exemplary graph thatcompares percentage of intersections that employ sketches compared withmethods of random sampling. The line labeled 310 indicates a linearsampling rate (e.g., sketches) that dominated the dashed graph 320,which represents conventional random sampling. As illustrated and bycomparing the percentage of intersections, it is readily apparent thatsketches depicted by 310 dominate random sampling 320. As illustrated,there exists one dashed curve 320 across all values of a, yet aplurality of indistinguishable solid curves depicted by graph 310depending on a.

Thus, the subject innovation enables an enhanced sampling (e.g.,non-randomized sampling), in conjunction with maximum likelihoodestimation, wherein a_(s)=a·s (s being a sampling rate). Hence, incontrast to conventional methods, (e.g., a_(s)=a·s²) the subjectinnovation supplies a more dynamic range on a cell of a contingencytable (e.g., a cell with smallest counts.) Moreover, by enhancing thesampling procedure, the subject innovation can facilitate obtaining abetter resolution, even though a more complex estimation can be required(e.g., a likelihood argument.)

FIGS. 4-6 illustrate examples of sample contingency table construction.In the corpus of FIG. 4, there are D=36 documents numbered from 1 to 36and sorted ascending. By choosing a (corpus) sampling rate of 50%, thenD_(s)=18. Since document IDs are assumed random, the first 18 documentscan be picked. Assuming there is interest in word W₁ and word W₂, thenthe documents that contain W₁ are marked in small circles, and documentsthat contain W₂ are marked in small squares, as depicted in FIG. 4.Subsequently, a sample contingency tables for word W₁ and word W₂ can beconstructed as:a _(s)=|{4,15}|=2, b _(s)=|{3, 7, 9, 10, 18}|=5, c _(s)=|{2, 5, 8}|=3, d_(s)=|{1, 6, 11, 12, 13, 14, 16, 17}|=8

FIG. 5 illustrates a procedure that employs sketches to construct thesame sample contingency table as conventional sampling, using the sameexample in FIG. 4. In this procedure, samples are supplied from thebeginning of the postings P₁ and P₂. In order to equivalently sample thefirst D_(s)=18 documents, all IDs in both sketches that are smaller thanor equal to 18 are sampled. After obtaining such samples a_(s), b_(s),c_(s) and d_(s) can be computed to construct the sample contingencytable, which is identical to the example in FIG. 4. Put differently, asthe document IDs in the postings are sorted ascending, one only needs tosample from the beginnings of P₁ and P₂ for IDs≦D_(s)=18, as illustratedin the boxed area 511. The sampling procedure produces a samplecontingency table: a_(s)=2, b_(s)=5, c_(s)=3 and d_(s)=8, identical tothe example in FIG. 4.

Such procedure takes advantage of the fact that the document IDs spanthe integers from 1 to D with no gaps. When the two sketches thatinclude all documents IDs smaller than or equal to D_(s) are compared,one has effectively looked at D_(s) documents in the originalcollection.

When sketches are constructed off-line for all words in a corpus, it ispossible that D_(s) is not known in advance. Moreover, it can bedesirable to effectively vary D_(s) for different word pairs. Foron-line sketch construction, it is also much easier to sample accordingto the postings sampling rate

$\left( \frac{k}{f} \right)$as opposed to the corpus sampling rate

$\left( \frac{D_{s}}{D} \right),$because during sampling one does not want to compare samplings againstD_(s).

Accordingly, a different sketch construction that does not requireknowing D_(s) in advance is illustrated in FIG. 6. In such procedure,sketches are built according the postings sampling rate, orequivalently, the pre-specified sketch sizes (k₁, k₂). The last elementsin K₁ and K₂ are respectively denoted as K_(1(k) ₁ ₎ and K_(2(k) ₂ ₎,using the standard “order statistics” notation (e.g., K_((j)) is jthsmallest element in K). One can treat D_(s)=min (K_(1(k) ₁ ₎, K_(2(k) ₂₎) and trim all documents IDs in K₁ and K₂ that are larger than D_(s),wherein:

$\begin{matrix}{{D_{s} = {\min\;\left\{ {K_{1{(k_{1})}},K_{2{(k_{2})}}} \right\}}},{k_{1}^{\prime} = {k_{1} - {\left\{ {{j\text{:}K_{1{(j)}}} > D_{s}} \right\} }}},\mspace{11mu}{k_{2}^{\prime} = {k_{2} - {\left\{ {{j\text{:}K_{2{(j)}}} > D_{s}} \right\} }}},{a_{s} = {{K_{1}\bigcap K_{2}}}},{b_{s} = {k_{1}^{\prime} - a_{s}}},{c_{s} = {k_{2}^{\prime} - a_{s}}},\;{d_{s} = {D_{s} - a_{s} - b_{s} - {c_{s}.}}}} & \left( {{eq}.\mspace{14mu} 7} \right)\end{matrix}$

Put differently, by employing the same corpus as in FIGS. 4 and 5, aprocedure to construct sample contingency tables from sketches, K₁ andK₂ (box 611) can be illustrated. K₁ consists of the first k₁=7 documentIDs in P₁; and K₂ consists of the first k₂=7 IDs in P₂. There are 11 IDsin both P₁ and P₂, and a=5 IDs in the intersection: {4, 15, 19, 24, 28}.In addition, D_(s)=min (8, 21)=18, and IDs 19 and 21 in K₂ are excludedfrom the sample because it cannot be determined if they are in theintersection or not, without looking outside the box 611. As it turnsout, 19 is in the intersection and 21 is not. This procedure generates asample contingency table: a_(s)=2, b_(s)=5, c_(s)=3 and d_(s)=8, thesame as in FIGS. 4 and 5.

Although both Procedure 1 (in FIG. 5) and Procedure 2 (in FIG. 6)produce the same sample contingency tables as the conventional randomsampling, they are different in that Procedure 1 requires apre-specified corpus sampling size D_(s) while Procedure 2 is moreflexible. However, conditional on D_(s), Procedure 2 is the same asProcedure 1. In one exemplar aspect, and to simplify the analysis, theestimation method of the subject innovation can be based on conditioningon D_(s). After constructing the sample contingency tables, the maximumlikelihood estimator (MLE) can estimate the most probable a by solving acubic MLE equation of:

$\begin{matrix}{{\frac{f_{1} - a + 1 - b_{s}}{f_{1} - a + 1}\frac{f_{2} - a + 1 - c_{s}}{f_{2} - a + 1}\frac{D - f_{1} - f_{2} + a}{D - f_{1} - f_{2} + a - d_{s}}\frac{a}{a - a_{s}}} = 1.} & \left( {{eq}.\mspace{14mu} 8} \right)\end{matrix}$

Assuming “sample-with-replacement,” one can have a slightly simplercubic MLE equation as indicated by:

$\begin{matrix}{{{{\frac{a_{s}}{a}\frac{b_{s}}{f_{1} - a}\frac{c_{s}}{f_{2} - a}} + \frac{d_{s}}{D - f_{1} - f_{2} + a}} = 0},} & \left( {{eq}.\mspace{14mu} 9} \right)\end{matrix}$

Instead of solving a cubic equation, one can also use an accurateclosed-form approximation of:

$\begin{matrix}{\hat{a} = \frac{\begin{matrix}{{f_{1}\left( {{2a_{s}} + c_{s}} \right)} + {f_{2}\left( {{2a_{s}} + b_{s}} \right)} -} \\\sqrt{\left( {{f_{1}\left( {{2a_{s}} + c_{s}} \right)} - {f_{2}\left( {{2a_{s}} + b_{s}} \right)}} \right)^{2} + {4f_{1}f_{2}b_{s}c_{s}}}\end{matrix}}{2\left( {{2a_{s}} + b_{s} + c_{s}} \right)}} & \left( {{eq}.\mspace{14mu} 10} \right)\end{matrix}$

As will be described in detail infra, eq. 10 will be derived in detailin conjunction with an analysis of the estimation errors, (which isdirectly related to the variance of the estimator) and the followingvariance formulas (conditional on D_(s)):

$\begin{matrix}{{{Var}\mspace{11mu}\left( \hat{a} \right)} \approx {\frac{\frac{D}{Ds} - 1}{\frac{1}{a} + \frac{1}{f_{1} - a} + \frac{1}{f_{2} - a} + \frac{1}{D - f_{1} - f_{2} + a}}.}} & \left( {{eq}.\mspace{14mu} 11} \right)\end{matrix}$

And an approximate unconditional variance, useful for choosing sketchsizes:

$\begin{matrix}{{{Var}\mspace{11mu}\left( \hat{a} \right)_{uc}} \approx {\frac{{\max\left( {\frac{f_{1}}{k_{1}},\frac{f_{2}}{k_{2}}} \right)} - 1}{\frac{1}{a} + \frac{1}{f_{1} - a} + \frac{1}{f_{2} - a} + \frac{1}{D - f_{1} - f_{2} + a}}.}} & \left( {{eq}.\mspace{14mu} 12} \right)\end{matrix}$

Based on statistical large sample theory, such variance formulas areaccurate when the sketch sizes are reasonable (e.g., ≧20-50).

The Proposed MLE of the Subject Innovation

In one exemplary aspect, the subject innovation estimates thecontingency table from the samples, the margins, and D. The mostprobable a, which maximizes the (full) likelihood (probability massfunction, PMF) (a_(s), b_(s), c_(s), d_(s); a) is desired. Even though,the exact expression for P(a_(s), b_(s), c_(s), d_(s); a) is not known,the conditional (partial) probability P(a_(s), b_(s), c_(s),d_(s)|D_(s); a) is known. Such is the PMF of a two-way samplecontingency table, based on the sketch construction Procedure 2 in FIG.6. Therefore, the full likelihood can be factored into:P(a _(s) , b _(s) , c _(s) , d _(s) ; a)=P(a _(s) , b _(s) , c _(s) , d_(s) |D _(s) ; a)×P(D _(s) ; a).  (eq. 13)

Since a strong dependency of D_(s) on a (as illustrated in FIG. 7) isnot present, a partial likelihood can also be employed which seeks the athat maximizes the partial likelihood P(a_(s), b_(s), c_(s),d_(s)|D_(s); a) instead of the full probability. Such partial likelihoodmethod is widely used is statistics (e.g., Cox proportional hazardsmodel in survival analysis, and the like).

Conditional on D_(s), the partial likelihood can be represented by:

$\begin{matrix}\begin{matrix}{{P\left( {a_{s},b_{s},c_{s},{\left. d_{s} \middle| D_{s} \right.;a}} \right)} = \frac{\begin{pmatrix}a \\a_{s}\end{pmatrix}\begin{pmatrix}b \\b_{s}\end{pmatrix}\begin{pmatrix}c \\c_{s}\end{pmatrix}\begin{pmatrix}d \\d_{s}\end{pmatrix}}{\begin{pmatrix}{a + b + c + d} \\{a_{s} + b_{s} + c_{s} + d_{s}}\end{pmatrix}}} \\{= {\frac{\begin{matrix}{\begin{pmatrix}a \\a_{s}\end{pmatrix}\begin{pmatrix}{f_{1} - a} \\b_{s}\end{pmatrix}\begin{pmatrix}{f_{2} - a} \\c_{s}\end{pmatrix}} \\\begin{pmatrix}{D - f_{1} - f_{2} + a} \\d_{s}\end{pmatrix}\end{matrix}}{\begin{pmatrix}D \\D_{s}\end{pmatrix}} \propto}} \\{\frac{a!}{\left( {a - a_{s}} \right)!} \times \frac{\left( {f_{1} - a} \right)!}{\left( {f_{1} - a - b_{s}} \right)!} \times} \\{\frac{\left( {f_{2} - a} \right)!}{\left( {f_{2} - a - c_{s}} \right)!} \times \frac{\left( {D - f_{1} - f_{2} + a} \right)!}{\left( {D - f_{1} - f_{2} + a - d_{s}} \right)!}} \\{= {\prod\limits_{i = 0}^{a_{s} - 1}{\left( {a - i} \right) \times {\prod\limits_{i = 0}^{b_{s} - 1}{\left( {f_{1} - a - i} \right) \times}}}}} \\{\prod\limits_{i = 0}^{c_{s} - 1}{\left( {f_{2} - a - i} \right) \times}} \\{{\prod\limits_{i = 0}^{d_{s} - 1}\left( {D - f_{1} - f_{2} + a - i} \right)},}\end{matrix} & \left( {{eq}.\mspace{14mu} 14} \right)\end{matrix}$wherein, the multiplicative terms not mentioning a are discarded, asthey typically will not contribute to the MLE.

Let â_(MLE) be the value of a that maximizes the partial likelihood (eq.14), (or equivalently, maximizes the log likelihood,) then log P(a_(s),b_(s), c_(s), d_(s)|D_(s); a) can be described as:

${{\sum\limits_{i = 0}^{a_{s} - 1}{\log\left( {a - i} \right)}} + {\sum\limits_{i = 0}^{b_{s} - 1}{\log\left( {f_{1} - a - i} \right)}} + {\sum\limits_{i = 0}^{c_{s} - 1}{\log\left( {f_{2} - a - i} \right)}} + {\sum\limits_{i = 0}^{d_{s} - 1}{\log\left( {D - f_{1} - f_{2} + a - i} \right)}}},$whose first derivative,

$\frac{{\partial\log}\;{P\left( {a_{s},b_{s},c_{s},{{d_{s}❘D_{s}};a}} \right)}}{\partial a},$is

$\begin{matrix}{{\sum\limits_{i = 0}^{a_{s} - 1}\frac{1}{a - i}} - {\sum\limits_{i = 0}^{b_{s} - 1}\frac{1}{f_{1} - a - i}} - {\sum\limits_{i = 0}^{c_{s} - 1}\frac{1}{f_{2} - a - i}} + {\sum\limits_{i = 0}^{d_{s} - 1}{\frac{1}{D - f_{1} - f_{2} + a - i}.}}} & \left( {{eq}.\mspace{14mu} 15} \right)\end{matrix}$

Since the second derivative,

$\frac{\left. {{{\partial^{2}\log}\;{P\left( {a_{s},b_{s},c_{s},d_{s}} \right.}D_{s}};a} \right)}{\partial a^{2}}$is,

${{\underset{i = 0}{\overset{a_{s} - 1}{- \sum}}\frac{1}{\left( {a - i} \right)^{2}}} - {\sum\limits_{i = 0}^{b_{s} - 1}\frac{1}{\left( {f_{1} - a - i} \right)^{2}}} - {\sum\limits_{i = 0}^{c_{s} - 1}\frac{1}{\left( {f_{2} - a - i} \right)^{2}}} - {\sum\limits_{i = 0}^{d_{s} - 1}\frac{1}{\left( {D - f_{1} - f_{2} + a - i} \right)^{2}}}},$and negative, then the log likelihood function is concave, andtherefore, there is a unique maximum. One could solve (eq. 15) for

$\frac{{\partial\log}\mspace{11mu}{P\left( {a_{s},b_{s},c_{s},{\left. d_{s} \middle| D_{s} \right.;a}} \right)}}{\partial a} = 0$numerically, yet there exists an exact solution using the updatedformula from (eq. 14), wherein:

$\begin{matrix}\begin{matrix}{\left. {{\left. {{{P\left( {a_{s},b_{s},c_{s},d_{s}} \right.}D_{s}};a} \right) = {{P\left( {a_{s},b_{s},c_{s},d_{s}} \right.}D_{s}}};{a - 1}} \right) \times} \\{\frac{a}{a - a_{s}}\frac{f_{1} - a + 1 - b_{s}}{f_{1} - a + 1}\frac{f_{2} - a + 1 - c_{s}}{f_{2} - a + 1}} \\{\frac{D - f_{1} - f_{2} + a}{D - f_{1} - f_{2} + a - d_{s}}} \\{\left. {{= {{P\left( {a_{s},b_{s},c_{s},d_{s}} \right.}D_{s}}};{a - 1}} \right) \times {{g(a)}.}}\end{matrix} & \left( {{eq}.\mspace{14mu} 16} \right)\end{matrix}$

Since it is known that the MLE exists and is unique, it suffices to findthe a from g(a)=1,

$\begin{matrix}{{{g(a)} = {{\frac{a}{a - a_{s}}\frac{f_{1} - a + 1 - b_{s}}{f_{1} - a + 1}\frac{f_{2} - a + 1 - c_{s}}{f_{2} - a + 1}\frac{D - f_{1} - f_{2} + a}{D - f_{1} - f_{2} + a - d_{s}}} = 1}},} & \left( {{eq}.\mspace{14mu} 17} \right)\end{matrix}$which is cubic in a (the fourth term vanishes) and can be solved byCardano formula. It is to be appreciated that numerical methods can alsobe employed. g(a)=1 is equivalent to q(a)=log g(a)=0. The firstderivative of q(a) is

$\begin{matrix}{{q^{\prime}(a)} = {\left( {\frac{1}{f_{1} - a + 1} - \frac{1}{f_{1} - a + 1 - b_{s}}} \right) + \left( {\frac{1}{f_{2} - a + 1} - \frac{1}{f_{2} - a + 1 - c_{s}}} \right) + \left( {\frac{1}{D - f_{1} - f_{2} + a} - \frac{1}{D - f_{1} - f_{2} + a - d_{s}}} \right) + {\left( {\frac{1}{a} - \frac{1}{a - a_{s}}} \right).}}} & \left( {{eq}.\mspace{14mu} 18} \right)\end{matrix}$

One can solve for q(a)=0 iteratively using the Newton's method,

$\begin{matrix}{a^{({new})} = {a^{({old})} - {\frac{q\left( a^{({old})} \right)}{q^{\prime}\left( a^{({old})} \right)}.}}} & \left( {{eq}.\mspace{14mu} 19} \right)\end{matrix}$

As explained above, FIG. 7 illustrates that the multivariatehypergeometric sample E(D_(s)) is not being sensitive to a, (D=2×10⁷,f₁=D/20, f₂=f₁/2.) The different curves correspond to a=0, 0.05, 0.2,0.5 and 0.9 f₂. Such curves 710 are almost indistinguishable except atvery low sampling rates. Moreover, it is to be appreciated that at asampling rate of 10⁻⁵, the sample size is k₂=5.

Likewise, under the “sample-with-replacement” assumption, the likelihoodfunction is slightly simpler as:

$\begin{matrix}{\left. {{{{P\left( {a_{s},b_{s},c_{s},d_{s}} \right.}D_{s}};a},r} \right) = {{\begin{pmatrix}D_{s} \\{a_{s},b_{s},c_{s},d_{s}}\end{pmatrix}\left( \frac{a}{D} \right)^{a_{s}}\left( \frac{b}{D} \right)^{b_{s}}\left( \frac{c}{D} \right)^{c_{s}}\left( \frac{d}{D} \right)^{d_{s}}} \propto {{a^{a_{s}}\left( {f_{1} - a} \right)}^{b_{s}} - {\left( {f_{2} - a} \right)^{c_{s}}{\left( {D - f_{1} - f_{2} + a} \right)^{d_{s}}.}}}}} & \left( {{eq}.\mspace{14mu} 20} \right)\end{matrix}$

Setting the first derivative of the log likelihood to be zero yields acubic equation:

$\begin{matrix}{{\frac{a_{s\;}}{a} - \frac{b_{s}}{f_{1} - a} - \frac{c_{s}}{f_{2} - a} + \frac{d_{s}}{D - f_{1} - f_{2} + a}} = 0.} & \left( {{eq}.\mspace{14mu} 21} \right)\end{matrix}$

In a related aspect of the subject innovation, a less accuratemargin-free base line can be provided instead of solving a cubicequation for the exact MLE. Accordingly, a convenient closed-formapproximation to the exact MLE is described below.

The “sample-with-replacement” can be assumed and a_(s) can be identifiedfrom K₁ without the knowledge of K₂. Put differently, it can be assumedthat:

${{\left. a_{s}^{(1)} \right.\sim{Binomial}}\mspace{11mu}\left( {{a_{s} + b_{s}},\frac{a}{f_{1}}} \right)},{{\left. a_{s}^{(2)} \right.\sim{Binomial}}\mspace{11mu}\left( {{a_{s} + c_{s}},\frac{a}{f_{2}}} \right)},{a_{s}^{(1)}\mspace{14mu}{and}\mspace{14mu} a_{s}^{(2)}}$are independent with a_(s) ⁽¹⁾=a_(s) ⁽²⁾=a_(s). The PMF of (a_(s)⁽¹⁾,a_(s) ⁽²⁾) is a product of two binomials:

$\begin{matrix}{{\left\lbrack {\begin{pmatrix}f_{1} \\{a_{s} + b_{s}}\end{pmatrix}\left( \frac{a}{f_{1}} \right)^{a_{s}}\left( \frac{f_{1} - a}{f_{1}} \right)^{b_{s}}} \right\rbrack \times \left\lbrack {\begin{pmatrix}f_{2} \\{a_{s} + c_{s}}\end{pmatrix}\left( \frac{a}{f_{2}} \right)^{a_{s}}\left( \frac{f_{2} - a}{f_{2}} \right)^{c_{s}}} \right\rbrack} \propto {{a^{2a_{s}}\left( {f_{1} - a} \right)}^{b_{s}}{\left( {f_{2} - a} \right)^{c_{s}}.}}} & \left( {{eq}.\mspace{14mu} 22} \right)\end{matrix}$

Setting the first derivative of the logarithm of (eq. 22) to be zero,the following can be obtained

$\begin{matrix}{{{\frac{2a_{s}}{a} - \frac{b_{s}}{f_{1} - a} - \frac{c_{s}}{f_{2} - a}} = 0},} & \left( {{eq}.\mspace{14mu} 23} \right)\end{matrix}$which is quadratic in a and has a convenient closed-form solution:

$\begin{matrix}{{\hat{a}}_{{MLE},a} = {\frac{\begin{matrix}{{f_{\; 1}\left( {{2a_{\; s}} + c_{\; s}} \right)} + {f_{\; 2}\left( {{2a_{\; s}} + b_{\; s}} \right)} -} \\\sqrt{\left( \;{{f_{\; 1}\left( {{2\; a_{\; s}}\; + \; c_{\; s}} \right)} - {f_{2}\left( {{2a_{s}} + b_{s}} \right)}} \right)^{2} + {4f_{1}f_{2}b_{s}c_{s}}}\end{matrix}}{2\left( {{2a_{s}} + b_{s} + c_{s}} \right)}.}} & \left( {{eq}.\mspace{14mu} 24} \right)\end{matrix}$

The second root can be ignored because it is always out of range:

$\frac{\begin{matrix}{{{f_{\; 1}\left( {{2\; a_{\; s}} + c_{\; s}} \right)}\; + \;{f_{\; 2}\;\left( {{2\; a_{\; s}} + b_{\; s}} \right)}\; -}\;} \\\sqrt{\left( \;{{f_{\; 1}\;\left( {{2\; a_{\; s}} + c_{\; s}} \right)} - {f_{2}\left( {{2\; a_{s}} + b_{s}} \right)}} \right)^{2} + {4\mspace{11mu} f_{1}\; f_{2}\; b_{s}\; c_{s}}}\end{matrix}}{2\;\left( {{2\; a_{s}} + b_{s} + c_{s}} \right)} \geq \frac{{f_{1}\left( {{2a_{s}} + c_{s}} \right)} + {f_{2}\left( {{2a_{s}} + b_{s}} \right)} - {{{f_{1}\left( {{2a_{s}} + c_{s}} \right)} - {f_{2}\left( {{2a_{s}} + b_{s}} \right)}}}}{2\;\left( {{2\; a_{s}} + b_{s} + c_{s}} \right)} \geq \begin{matrix}{{f_{1}\mspace{14mu}{if}\mspace{14mu}{f_{1}\left( {{2a_{s}} + c_{s}} \right)}} \geq {f_{2}\left( {{2a_{s}} + b_{s}} \right)}} \\{{f_{2}\mspace{14mu}{if}\mspace{14mu}{f_{1}\left( {{2a_{s}} + c_{s\;}} \right)}} < {f_{2}\left( {{2a_{s}} + b_{s}} \right)}}\end{matrix} \geq {\min\mspace{11mu}\left( {f_{1},f_{2}} \right)}$wherein, typically â_(MLE,a) is very close to â_(MLE). Thus, byenhancing the sampling procedure, the subject innovation can facilitateobtaining a better resolution, even though a more complex estimation canbe required (e.g., a likelihood argument.)

Referring now FIG. 8, there is illustrated an exemplary samplingcomponent 800 that estimates contingency tables for designating dataassociations. The sampling component 800 can include a contingency tablegenerator component 820, and an estimation component 830. Thecontingency table generator component 820 is programmed and/orconfigured to access stored data and construct corresponding contingencytable(s) for designated word pair(s). The stored data can exist in adatabase having a plurality of records with associated fields populatedby one or more processes or services over time, for example. Such datacan be stored at one or more storage locations (local or remote)relative to the instance of the contingency table generator component820. For the example of Web-related data, a server associated with theWeb site may collect data based on forms submitted by the user, based oncookies associated with the user, and/or based on user log files. Theserver may, in turn, integrate the collected data with other datasources and organize such information according to a predeterminedformat. The estimation component 830 can then apply a probabilisticanalysis to locate most likely contingency tables in the original spacegiven the sketch and constraints explained in detail supra. As such, atypical requirement to examine an entire document (or list of documents)to determine whether two words are associated can be mitigated.

FIG. 9 illustrates an exemplary methodology 900 of studying associationsin accordance with an aspect of the subject innovation. While theexemplary method is illustrated and described herein as a series ofblocks representative of various events and/or acts, the subjectinnovation is not limited by the illustrated ordering of such blocks.For instance, some acts or events may occur in different orders and/orconcurrently with other acts or events, apart from the orderingillustrated herein, in accordance with the innovation. In addition, notall illustrated blocks, events or acts, may be required to implement amethodology in accordance with the subject innovation. Moreover, it willbe appreciated that the exemplary method and other methods according tothe innovation may be implemented in association with the methodillustrated and described herein, as well as in association with othersystems and apparatus not illustrated or described.

Initially, and at 910 the word pairs are sampled for estimatingassociations. Subsequently and at 920, contingency tables are computedfor the word pairs. A likelihood argument (e.g., maximum likelihood) canthen be applied to the contingency tables, at 930. The most likelycontingency table can then be located in the original space at 940,given the sketch and constraints. Subsequently, and at 950 thecontingency tables can be summarized and employed in desiredapplications (e.g., word associations).

The subject innovation (e.g., in connection with estimating associationsfor word pairs) can employ various artificial intelligence based schemesfor carrying out various aspects thereof. For example, a process forlearning explicitly or implicitly when a contingency table should begenerated can be facilitated via an automatic classification system andprocess. As used herein, the term “inference” refers generally to theprocess of reasoning about or inferring states of the system,environment, and/or user from a set of observations as captured viaevents and/or data. Inference can be employed to identify a specificcontext or action, or can generate a probability distribution overstates, for example. The inference can be probabilistic-that is, thecomputation of a probability distribution over states of interest basedon a consideration of data and events. Inference can also refer totechniques employed for composing higher-level events from a set ofevents and/or data. Such inference results in the construction of newevents or actions from a set of observed events and/or stored eventdata, whether or not the events are correlated in close temporalproximity, and whether the events and data come from one or severalevent and data sources.

Classification can employ a probabilistic and/or statistical-basedanalysis (e.g., factoring into the analysis utilities and costs) toprognose or infer an action that a user desires to be automaticallyperformed. For example, a support vector machine (SVM) classifier can beemployed. Other classification approaches include Bayesian networks,decision trees, and probabilistic classification models providingdifferent patterns of independence can be employed. Classification asused herein also is inclusive of statistical regression that is utilizedto develop models of priority.

As will be readily appreciated from the subject specification, thesubject innovation can employ classifiers that are explicitly trained(e.g., via a generic training data) as well as implicitly trained (e.g.,via observing user behavior, receiving extrinsic information) so thatthe classifier is used to automatically determine according to apredetermined criteria which answer to return to a question. Forexample, SVM's are configured via a learning or training phase within aclassifier constructor and feature selection module. A classifier is afunction that maps an input attribute vector, x=(x1, x2, x3, x4, xn), toa confidence that the input belongs to a class—that is,f(x)=confidence(class).

In order to provide a context for the various aspects of the disclosedsubject matter, FIGS. 10 and 11 as well as the following discussion areintended to provide a brief, general description of a suitableenvironment in which the various aspects of the disclosed subject mattermay be implemented. While the subject matter has been described above inthe general context of computer-executable instructions of a computerprogram that runs on a computer and/or computers, those skilled in theart will recognize that the innovation also may be implemented incombination with other program modules. Generally, program modulesinclude routines, programs, components, data structures, etc. thatperform particular tasks and/or implement particular abstract datatypes. Moreover, those skilled in the art will appreciate that theinnovative methods can be practiced with other computer systemconfigurations, including single-processor or multiprocessor computersystems, mini-computing devices, mainframe computers, as well aspersonal computers, hand-held computing devices (e.g., personal digitalassistant (PDA), phone, watch . . . ), microprocessor-based orprogrammable consumer or industrial electronics, and the like. Theillustrated aspects may also be practiced in distributed computingenvironments where tasks are performed by remote processing devices thatare linked through a communications network. However, some, if not allaspects of the innovation can be practiced on stand-alone computers. Ina distributed computing environment, program modules may be located inboth local and remote memory storage devices.

With reference to FIG. 10, an exemplary environment 1010 forimplementing various aspects of the subject innovation is described thatincludes a computer 1012. The computer 1012 includes a processing unit1014, a system memory 1016, and a system bus 1018. The system bus 1018couples system components including, but not limited to, the systemmemory 1016 to the processing unit 1014. The processing unit 1014 can beany of various available processors. Dual microprocessors and othermultiprocessor architectures also can be employed as the processing unit1014.

The system bus 1018 can be any of several types of bus structure(s)including the memory bus or memory controller, a peripheral bus orexternal bus, and/or a local bus using any variety of available busarchitectures including, but not limited to, 11-bit bus, IndustrialStandard Architecture (ISA), Micro-Channel Architecture (MSA), ExtendedISA (EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB),Peripheral Component Interconnect (PCI), Universal Serial Bus (USB),Advanced Graphics Port (AGP), Personal Computer Memory CardInternational Association bus (PCMCIA), and Small Computer SystemsInterface (SCSI).

The system memory 1016 includes volatile memory 1020 and nonvolatilememory 1022. The basic input/output system (BIOS), containing the basicroutines to transfer information between elements within the computer1012, such as during start-up, is stored in nonvolatile memory 1022. Byway of illustration, and not limitation, nonvolatile memory 1022 caninclude read only memory (ROM), programmable ROM (PROM), electricallyprogrammable ROM (EPROM), electrically erasable ROM (EEPROM), or flashmemory. Volatile memory 1020 includes random access memory (RAM), whichacts as external cache memory. By way of illustration and notlimitation, RAM is available in many forms such as synchronous RAM(SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rateSDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), anddirect Rambus RAM (DRRAM).

Computer 1012 also includes removable/non-removable,volatile/non-volatile computer storage media. FIG. 10 illustrates, forexample a disk storage 1024. Disk storage 1024 includes, but is notlimited to, devices like a magnetic disk drive, floppy disk drive, tapedrive, Jaz drive, Zip drive, LS-60 drive, flash memory card, or memorystick. In addition, disk storage 1024 can include storage mediaseparately or in combination with other storage media including, but notlimited to, an optical disk drive such as a compact disk ROM device(CD-ROM), CD recordable drive (CD-R Drive), CD rewritable drive (CD-RWDrive) or a digital versatile disk ROM drive (DVD-ROM). To facilitateconnection of the disk storage devices 1024 to the system bus 1018, aremovable or non-removable interface is typically used such as interface1026.

It is to be appreciated that FIG. 10 describes software that acts as anintermediary between users and the basic computer resources described insuitable operating environment 1010. Such software includes an operatingsystem 1028. Operating system 1028, which can be stored on disk storage1024, acts to control and allocate resources of the computer system1012. System applications 1030 take advantage of the management ofresources by operating system 1028 through program modules 1032 andprogram data 1034 stored either in system memory 1016 or on disk storage1024. It is to be appreciated that various components described hereincan be implemented with various operating systems or combinations ofoperating systems.

A user enters commands or information into the computer 1012 throughinput device(s) 1036. Input devices 1036 include, but are not limitedto, a pointing device such as a mouse, trackball, stylus, touch pad,keyboard, microphone, joystick, game pad, satellite dish, scanner, TVtuner card, digital camera, digital video camera, web camera, and thelike. These and other input devices connect to the processing unit 1014through the system bus 1018 via interface port(s) 1038. Interfaceport(s) 1038 include, for example, a serial port, a parallel port, agame port, and a universal serial bus (USB). Output device(s) 1040 usesome of the same type of ports as input device(s) 1036. Thus, forexample, a USB port may be used to provide input to computer 1012, andto output information from computer 1012 to an output device 1040.Output adapter 1042 is provided to illustrate that there are some outputdevices 1040 like monitors, speakers, and printers, among other outputdevices 1040 that require special adapters. The output adapters 1042include, by way of illustration and not limitation, video and soundcards that provide a means of connection between the output device 1040and the system bus 1018. It should be noted that other devices and/orsystems of devices provide both input and output capabilities such asremote computer(s) 1044.

Computer 1012 can operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer(s)1044. The remote computer(s) 1044 can be a personal computer, a server,a router, a network PC, a workstation, a microprocessor based appliance,a peer device or other common network node and the like, and typicallyincludes many or all of the elements described relative to computer1012. For purposes of brevity, only a memory storage device 1046 isillustrated with remote computer(s) 1044. Remote computer(s) 1044 islogically connected to computer 1012 through a network interface 1048and then physically connected via communication connection 1050. Networkinterface 1048 encompasses communication networks such as local-areanetworks (LAN) and wide-area networks (WAN). LAN technologies includeFiber Distributed Data Interface (FDDI), Copper Distributed DataInterface (CDDI), Ethernet/IEEE 802.3, Token Ring/IEEE 802.5 and thelike. WAN technologies include, but are not limited to, point-to-pointlinks, circuit switching networks like Integrated Services DigitalNetworks (ISDN) and variations thereon, packet switching networks, andDigital Subscriber Lines (DSL).

Communication connection(s) 1050 refers to the hardware/softwareemployed to connect the network interface 1048 to the bus 1018. Whilecommunication connection 1050 is shown for illustrative clarity insidecomputer 1012, it can also be external to computer 1012. Thehardware/software necessary for connection to the network interface 1048includes, for exemplary purposes only, internal and externaltechnologies such as, modems including regular telephone grade modems,cable modems and DSL modems, ISDN adapters, and Ethernet cards.

FIG. 11 is a schematic block diagram of a sample-computing environment1100 that can be employed to examine associations via contingency tablesin accordance with an aspect of the subject innovation. The system 1100includes one or more client(s) 1110. The client(s) 1110 can be hardwareand/or software (e.g., threads, processes, computing devices). Thesystem 1100 also includes one or more server(s) 1130. The server(s) 1130can also be hardware and/or software (e.g., threads, processes,computing devices). The servers 1130 can house threads to performtransformations by employing the components described herein, forexample. One possible communication between a client 1110 and a server1130 may be in the form of a data packet adapted to be transmittedbetween two or more computer processes. The system 1100 includes acommunication framework 1150 that can be employed to facilitatecommunications between the client(s) 1110 and the server(s) 1130. Theclient(s) 1110 are operably connected to one or more client datastore(s) 1160 that can be employed to store information local to theclient(s) 1110. Similarly, the server(s) 1130 are operably connected toone or more server data store(s) 1140 that can be employed to storeinformation local to the servers 1130.

What has been described above includes various exemplary aspects. It is,of course, not possible to describe every conceivable combination ofcomponents or methodologies for purposes of describing these aspects,but one of ordinary skill in the art may recognize that many furthercombinations and permutations are possible. For example, the subjectinnovation can be extended over two way associations to encompassmulti-way associations. Accordingly, the aspects described herein areintended to embrace all such alterations, modifications and variationsthat fall within the spirit and scope of the appended claims.Furthermore, to the extent that the term “includes” is used in eitherthe detailed description or the claims, such term is intended to beinclusive in a manner similar to the term “comprising” as “comprising”is interpreted when employed as a transitional word in a claim.

1. A computer implemented system comprising the following computerexecutable components: a memory; a processor that executes the followingsoftware components; a sampling component that employs sketches tofacilitate generation of contingency tables and forms a sample of data;and an estimation component that employs a probabilistic argument tolocate a most likely contingency table for the sample; wherein thecontingency tables comprise a sample intersection cell (a_(s)) that isequal to an intersection cell (a) multiplied by a sampling rate (s);wherein the processor finds and outputs to a computer display, estimateson word pair associations in searching document lists on a computer; andwherein the sketches employed to facilitate generation of thecontingency tables comprise sketches K₁, K₂, andD_(s) = min   {K_(1(k₁)), K_(2(k₂))}, k₁^(′) = k₁ − {j : K_(1(j)) > D_(s)}, k₂^(′) = k₂ − {j : K_(2(j)) > D_(s)}, a_(s) = K₁⋂K₂, b_(s) = k₁^(′) − a_(s), c_(s) = k₂^(′) − a_(s), d_(s) = D_(s) − a_(s) − b_(s) − c_(s).wherein the most likely contingency table is determined via defining amaximum likelihood estimation as $\begin{matrix}{{P\left( {a_{s},b_{s},c_{s\;},{\left. d_{s} \middle| D_{s} \right.;a},r} \right)} = \frac{\begin{pmatrix}a \\a_{s}\end{pmatrix}\begin{pmatrix}b \\b_{s}\end{pmatrix}\begin{pmatrix}c \\c_{s}\end{pmatrix}\begin{pmatrix}d \\d_{s}\end{pmatrix}}{\begin{pmatrix}{a + b + c + d} \\{a_{s} + b_{s} + c_{s} + d_{s}}\end{pmatrix}}} \\{= {\frac{\begin{pmatrix}a \\a_{s}\end{pmatrix}\begin{pmatrix}{f_{1} - a} \\b_{s}\end{pmatrix}\begin{pmatrix}{f_{2} - a} \\c_{s}\end{pmatrix}\begin{pmatrix}{D - f_{1} - f_{2} + a} \\d_{s}\end{pmatrix}}{\begin{pmatrix}D \\D_{s}\end{pmatrix}}.}}\end{matrix}$ and a closed form solution is presented in the form of${\hat{a}}_{{MLE},a} = {\frac{\begin{matrix}{{f_{\; 1}\left( {{2a_{\; s}} + c_{\; s}} \right)} + {f_{\; 2}\left( {{2a_{\; s}} + b_{\; s}} \right)} -} \\\sqrt{\left( \;{{f_{\; 1}\left( {{2\; a_{\; s}} + c_{\; s}} \right)} - {f_{2}\left( {{2a_{s}} + b_{s}} \right)}} \right)^{2} + {4f_{1}f_{2}b_{s}c_{s}}}\end{matrix}}{2\left( {{2a_{s}} + b_{s} + c_{s}} \right)}.}$
 2. Thecomputer implemented system of claim 1, the estimation component furthercomprising a maximum likelihood estimation.
 3. The computer implementedsystem of claim 2, further comprising a stopping rule that is based onthe likelihood estimation.
 4. The computer implemented system of claim1, further comprising an artificial intelligence component thatfacilitates estimation of word associations, via utilizing an automaticclassification system that learns explicitly or implicitly when thecontingency tables should be generated, wherein the classificationsystem employs a probabilistic and/or statistical-based analysis toinfer an action that a user desires to be automatically performed.
 5. Acomputer implemented method comprising the following computer executableacts: sampling a collection of data to form a sample space; computing acontingency table within the sample space for features that requireassociation estimation; employing sketches to facilitate generation ofthe contingency table and to form a sample of data; employing aprobabilistic argument on the contingency table to determine the mostlikely contingency table; employing a linear relation between a sampleintersection cell and a sampling rate, wherein the contingency tablecomprises a sample intersection cell (a_(s)) that is equal to anintersection cell (a) multiplied by a sampling rate (s); finding andoutputting estimates on word pair associations in searching documentlists on a computer; employingD_(s) = min   {K_(1(k₁)), K_(2(k₂))}, k₁^(′) = k₁ − {j : K_(1(j)) > D_(s)}, k₂^(′) = k₂ − {j : K_(2(j)) > D_(s)}, a_(s) = K₁⋂K₂, b_(s) = k₁^(′) − a_(s), c_(s) = k₂^(′) − a_(s), d_(s) = D_(s) − a_(s) − b_(s) − c_(s)for sketches K₁, K₂; determining the most likely contingency table viadefining a maximum likelihood estimation as${{P\left( {a_{s},b_{s},c_{s\;},{\left. d_{s} \middle| D_{s} \right.;a},r} \right)} = {{\begin{pmatrix}D_{s} \\{a_{s},b_{s},c_{s\;},d_{s}}\end{pmatrix}\left( \frac{a}{D} \right)^{a_{s}}\left( \frac{b}{D} \right)^{b_{s}}\left( \frac{c}{D} \right)^{c_{s}}\left( \frac{d}{D} \right)^{d_{s}}} \propto {{a^{a_{s}}\left( {f_{1} - a} \right)}^{b_{s}} - {\left( {f_{2} - a} \right)^{c_{s}}\left( {D - f_{1} - f_{2} + a} \right)^{d_{s}}}}}};$and presenting a closed form solution in form of${\hat{a}}_{{MLE},a} = {\frac{\begin{matrix}{{f_{1}\left( {{2a_{s}} + c_{s}} \right)} + {f_{2}\left( {{2a_{s}} + b_{s}} \right)} -} \\\sqrt{\left( {{f_{1}\left( {{2a_{s}} + c_{s}} \right)} - {f_{2}\left( {{2a_{s}} + b_{s}} \right)}} \right)^{2} + {4f_{1}f_{2}b_{s}c_{s}}}\end{matrix}}{2\left( {{2a_{s}} + b_{s} + c_{s}} \right)}.}$
 6. Themethod of claim 5 further comprising employing a maximum likelihoodestimation.
 7. The method of claim 5 further comprising employingvariances based on document frequencies.
 8. The method of claim 5further comprising representing a partial likelihood as${P\left( {a_{s},b_{s},c_{s},{\left. d_{s} \middle| D_{s} \right.;a}} \right)} = {\frac{\begin{pmatrix}a \\a_{s}\end{pmatrix}\begin{pmatrix}b \\b_{s}\end{pmatrix}\begin{pmatrix}c \\c_{s}\end{pmatrix}\begin{pmatrix}d \\d_{s}\end{pmatrix}}{\begin{pmatrix}{a + b + c + d} \\{a_{s} + b_{s} + c_{s} + d_{s}}\end{pmatrix}} = {\frac{\begin{pmatrix}a \\a_{s}\end{pmatrix}\begin{pmatrix}{f_{1} - a} \\b_{s}\end{pmatrix}\begin{pmatrix}{f_{2} - a} \\c_{s}\end{pmatrix}\begin{pmatrix}{D - f_{1} - f_{2} + a} \\d_{s}\end{pmatrix}}{\begin{pmatrix}D \\D_{s}\end{pmatrix}}.}}$
 9. The method of claim 8 further comprisingpresenting a solution as: $\begin{matrix}{{g(a)} = {\frac{a}{a - a_{s}}\frac{f_{1} - a + 1 - b_{s}}{f_{1} - a + 1}\frac{f_{2} - a + 1 - c_{s}}{f_{2} - a + 1}\frac{D - f_{1} - f_{2} + a}{D - f_{1} - f_{2} + a - d_{s}}}} \\{= 1.}\end{matrix}$
 10. A computer implemented system comprising the followingcomputer executable components: means for sampling a collection ofdocuments to form a sample space; means for computing a contingencytable within the sample space; means for employing sketches tofacilitate generation of the contingency table and to form a sample ofdata; means for employing a probabilistic argument on the contingencytable to determine the most likely contingency table; means foremploying a linear relation between a sample intersection cell and asampling rate, wherein the contingency table comprises a sampleintersection cell (a_(s)) that is equal to an intersection cell (a)multiplied by a sampling rate (s); means for finding and outputtingestimates on word pair associations in using the sample intersectioncell and the sampling rate; means for employing $\begin{matrix}{{D_{s} = {\min\left\{ {K_{1{(k_{1})}},K_{2{(k_{2})}}} \right\}}},} \\{{k_{1}^{\prime} = {k_{1} - {\left\{ {j:{K_{1{(j)}} > D_{s}}} \right\} }}},{k_{2}^{\prime} = {k_{2} - {\left\{ {j:{K_{2{(j)}} > D_{s}}} \right\} }}},} \\{{a_{s} = {{K_{1}\bigcap K_{2}}}},{b_{s} = {k_{1}^{\prime} - a_{s}}},{c_{s} = {k_{2}^{\prime} - a_{s}}},{d_{s} = {D_{s} - a_{s} - b_{s} - c_{s}}}}\end{matrix}$ for sketches K1, K2; means for determining the most likelycontingency table via defining a maximum likelihood estimation as${{P\left( {a_{s},b_{s},c_{s},{\left. d_{s} \middle| D_{s} \right.;a},r} \right)} = {{\begin{pmatrix}D_{s} \\{a_{s},b_{s},c_{s},d_{s}}\end{pmatrix}\left( \frac{a}{D} \right)^{a_{s}}\left( \frac{b}{D} \right)^{b_{s}}\left( \frac{c}{D} \right)^{c_{s}}\left( \frac{d}{D} \right)^{d_{s}}} \propto {{a^{a_{s}}\left( {f_{1} - a} \right)}^{b_{s}} - {\left( {f_{2} - a} \right)^{c_{s}}\left( {D - f_{1} - f_{2} + a} \right)^{d_{s}}}}}};$and means for presenting a closed form solution in form of${\hat{a}}_{{MLE},a} = {\frac{\begin{matrix}{{f_{1}\left( {{2a_{s}} + c_{s}} \right)} + {f_{2}\left( {{2a_{s}} + b_{s}} \right)} -} \\\sqrt{\left( {{f_{1}\left( {{2a_{s}} + c_{s}} \right)} - {f_{2}\left( {{2a_{s}} + b_{s}} \right)}} \right)^{2} + {4f_{1}f_{2}b_{s}c_{s}}}\end{matrix}}{2\left( {{2a_{s}} + b_{s} + c_{s}} \right)}.}$